fix: pipeline hangs when submitting from compute nodes by jayhesselberth · Pull Request #450 · snakemake/snakemake-executor-plugin-slurm

jayhesselberth · 2026-04-05T00:32:28Z

When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever.

The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle.

Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep.

Summary by CodeRabbit

Bug Fixes
- Cleaner SLURM environment detection and immediate cleanup during executor startup, with an earlier warning when a SLURM job context is present to improve job submission reliability.
Tests
- Test suite updated to align with the revised executor initialization and warning behavior; expectations remain unchanged.

When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever. The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle. Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-05T00:32:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d122b4d7-cf3a-40e6-8911-e207c23c22df

📥 Commits

Reviewing files that changed from the base of the PR and between 809a21e and bbe651a.

📒 Files selected for processing (1)

snakemake_executor_plugin_slurm/__init__.py

🚧 Files skipped from review as they are similar to previous changes (1)

snakemake_executor_plugin_slurm/init.py

Walkthrough

Executor now performs SLURM-job-context detection and calls delete_slurm_environment() during __init__ (before super().__init__), removing the previous warn_on_jobcontext and its delayed cleanup. Tests were updated to stop mocking the removed warning method.

Changes

Cohort / File(s)	Summary
SLURM Executor Initialization `snakemake_executor_plugin_slurm/__init__.py`	Added `Executor.__init__(self, workflow, logger)` that checks `SLURM_JOB_ID`, logs a warning, and calls `delete_slurm_environment()` before `super().__init__`. Removed `warn_on_jobcontext` and its `__post_init__` invocation. Minor formatting change to the `"PREEMPTED"` warning string and simplified tuple assignment in `check_active_jobs`.
Tests `tests/test_cli.py`	Removed mocks of `Executor.warn_on_jobcontext` in tests `test_jobname_prefix_applied` and `test_jobname_prefix_validation`; tests now rely on real initialization behavior (still patching `uuid.uuid4` where applicable).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudged init early, sniffed SLURM in the air,

Swept the env away with a twitch and a care,
No sleepy delay, no post-time tumble,
Fresh start, light paws — the runtime won't grumble!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: pipeline hangs when submitting from compute nodes' directly and specifically addresses the core issue: preventing hangs when submitting from compute nodes by fixing the order of SLURM environment cleanup.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/test_cli.py (1)
37-50: Please add a regression test that hits Executor.__init__().

These tests still build the object with Executor.__new__() and call __post_init__() directly, so the moved cleanup path in Executor.__init__() is never exercised. Please add one test that instantiates Executor(...) with SLURM_JOB_ID set and patches RemoteExecutor.__init__() to assert the environment is already cleaned before base initialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 37 - 50, Add a regression test that
constructs the real Executor by calling Executor(...) (not using __new__ +
__post_init__) with SLURM_JOB_ID set in os.environ, and patch
RemoteExecutor.__init__ to assert that os.environ lacks SLURM_JOB_ID (i.e., the
cleanup in Executor.__init__ ran) before delegating to the original
RemoteExecutor.__init__; use the same test helpers as other tests (e.g.,
_make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_cli.py`:
- Around line 37-50: Add a regression test that constructs the real Executor by
calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set
in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks
SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to
the original RemoteExecutor.__init__; use the same test helpers as other tests
(e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de74c8aa-668c-4c49-9b3f-0ed28df0bb6c

📥 Commits

Reviewing files that changed from the base of the PR and between 7fa975f and 3d024a0.

📒 Files selected for processing (2)

snakemake_executor_plugin_slurm/__init__.py
tests/test_cli.py

cmeesters · 2026-04-05T07:16:40Z

Thanks for this PR!

At the Snakemake Hackathon I noticed, that even when unsetting all $SLURM... env vars before starting Snakemake within a job, all jobs are submitted with only one thread. I did not find the time to investigate. Are you experiencing the same issue? If not, what did you do? Perhaps we can profit from that experience.

cmeesters · 2026-04-15T13:20:09Z

@jayhesselberth I am actually fine with this PR. Will you apply black on the code to fix the formatting issue?

What I meant by my last remark: If you have an order of commands which solves the start-within-jobcontext-issue, I am eager to learn.

jayhesselberth · 2026-04-15T15:48:36Z

@cmeesters in our case, it was a combination of this fix and not having sacct set up correctly on some of our compute nodes (some couldn't talk to slurmdbctl).

cmeesters

@jayhesselberth Ok, I will fix the formatting prior to the next release, but will merge it already.

gernophil · 2026-04-29T07:23:24Z

While this definitely seems to make it more robust, if started from a compute node, I had my workflow hanging today for 4.5h without submitting any jobs before I had to manually cancel, unlock and restart it:

...
[Wed Apr 29 03:49:49 2026]
Finished jobid: 60 (Rule: xxx)
2 of 234 steps (1%) done
Select jobs to execute...
Execute 16 jobs...
[2026-04-29T08:24:21.031] error: *** JOB XXXXXXXX ON hpc-cpu-xxx CANCELLED AT 2026-04-29T08:24:21 DUE to SIGNAL Terminated ***

cmeesters · 2026-04-30T11:26:49Z

@gernophil I'm afraid, this is probably a different issue. Please write a comprehensive description as a stand-alone issue report.

gernophil · 2026-05-01T07:33:46Z

@cmeesters Thanks for the feedback. Might be related to wofklow profile vs. defining resources via CLI. I'll perform some more systematic tests and also talk to our Slurm admins and then come back with a stand-alone issue ones I know more.

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

style: format with black

809a21e

cmeesters added 2 commits April 16, 2026 22:36

Merge branch 'main' into fix/compute-node-hang

bbe651a

Merge branch 'main' into fix/compute-node-hang

4cf6bff

cmeesters self-requested a review April 17, 2026 08:44

cmeesters approved these changes Apr 17, 2026

View reviewed changes

cmeesters merged commit a09a027 into snakemake:main Apr 17, 2026
4 of 5 checks passed

snakemake-bot mentioned this pull request Apr 17, 2026

chore(main): release 2.6.2 #453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pipeline hangs when submitting from compute nodes#450

fix: pipeline hangs when submitting from compute nodes#450
cmeesters merged 4 commits intosnakemake:mainfrom
jayhesselberth:fix/compute-node-hang

jayhesselberth commented Apr 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

cmeesters commented Apr 5, 2026

Uh oh!

cmeesters commented Apr 15, 2026

Uh oh!

jayhesselberth commented Apr 15, 2026

Uh oh!

cmeesters left a comment

Uh oh!

Uh oh!

gernophil commented Apr 29, 2026

Uh oh!

cmeesters commented Apr 30, 2026

Uh oh!

gernophil commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jayhesselberth commented Apr 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cmeesters commented Apr 5, 2026

Uh oh!

cmeesters commented Apr 15, 2026

Uh oh!

jayhesselberth commented Apr 15, 2026

Uh oh!

cmeesters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gernophil commented Apr 29, 2026

Uh oh!

cmeesters commented Apr 30, 2026

Uh oh!

gernophil commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jayhesselberth commented Apr 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading